ASTER: accurately estimating the number of cell types in single-cell chromatin accessibility data

Abstract Summary Recent innovations in single-cell chromatin accessibility sequencing (scCAS) have revolutionized the characterization of epigenomic heterogeneity. Estimation of the number of cell types is a crucial step for downstream analyses and biological implications. However, efforts to perform estimation specifically for scCAS data are limited. Here, we propose ASTER, an ensemble learning-based tool for accurately estimating the number of cell types in scCAS data. ASTER outperformed baseline methods in systematic evaluation on 27 datasets of various protocols, sizes, numbers of cell types, degrees of cell-type imbalance, cell states and qualities, providing valuable guidance for scCAS data analysis. Availability and implementation ASTER along with detailed documentation is freely accessible at https://aster.readthedocs.io/ under the MIT License. It can be seamlessly integrated into existing scCAS analysis workflows. The source code is available at https://github.com/biox-nku/aster. Supplementary information Supplementary data are available at Bioinformatics online.


Supplementary Texts
Text S1. Ideas of existing methods for scRNA-seq data To estimate the number of cell types in a given scRNA-seq dataset, a number of remarkable methodologies have been proposed (Yu, et al., 2022). These methodologies can be grouped into four main categories. The first category is based on the idea of community detection, which optimizes community structure to find the best possible grouping and is the most widelyused category in the single-cell data analysis community. The typical methods of this category are Louvain algorithm (Blondel, et al., 2008) and Leiden algorithm (Traag, et al., 2019), which have been adopted in almost all the conventional scRNA-seq data analysis workflows, such as Seurat (Hao, et al., 2021), Scanpy (Wolf, et al., 2018), and Monocle3 (Cao, et al., 2019).
We note that Louvain and Leiden also have been adopted in scCAS data analysis workflows, such as Signac (Stuart, et al., 2021), ArchR (Granja, et al., 2021) and EpiScanpy (Danese, et al., 2021). The second category is based on the intuitive idea that a good clustering should have high intra-cluster similarity (cells within a cluster are similar) and low inter-cluster similarity (cells from different clusters are dissimilar). For example, scLCA uses the silhouette coefficient to study the separation distance between the resulting clusters (Cheng, et al., 2019), CIDR adopts the Calinski and Harabasz score to measure the ratio of the sum of inter-cluster dispersion and the sum of intra-cluster dispersion for all clusters (Lin, et al., 2017), and RaceID is based on the Gap statistic that compares the change in intra-cluster dispersion with that expected under an appropriate reference null distribution (Grün, et al., 2015). The third category is based on the idea of eigengap heuristic, which suggests the number of cell types as the value that maximizes the eigengap (difference between consecutive eigenvalues).
There are several scRNA-seq data-specific methods belong to this category, such as SIMLR (Wang, et al., 2017), Spectrum (John, et al., 2020) and SC3 (Kiselev, et al., 2017). The fourth category is based on the idea that more stable and reproducible clustering results can be obtained by specifying the optimal number of cell types rather than by specifying a suboptimal number of cell types. The methods of this category are usually incorporated with ensemble learning strategies. For example, scCCESS assumes that clustering using the optimal number of clusters would be the most robust to small perturbations in the scRNA-seq data, and estimates the number of cell types by ensemble clustering (Yu, et al., 2022).
Although both scRNA-seq data and scCAS data can be expressed in the form of a matrix after preprocessing and the methods designed for scRNA-seq data can be applied to scCAS data theoretically, the characteristics of scCAS data make the application subject to various obstacles. For example, scRNA-seq data-specific methods become much more timeconsuming and memory-consuming when they are applied to high-dimension scCAS data, and thus can hardly be applied to large-scale scCAS data from the technical perspective.

Text S3. TF-IDF transformation
We apply term frequency-inverse document frequency (TF-IDF) transformation (V1) to peak ! and cell " as: which represents how important the peak ! is for cell ", and is then normalized as a recent study for scCAS data estimation (Li, et al., 2021): For TF-IDF transformation (V2), we first weight all the peaks in individual cells by the 'term frequency', which is the total number of accessible peaks in that cell, and then multiply the weighted matrix by log(1 + inverse document frequency), which is the inverse frequency of each peak across all cells (Chen, et al., 2021;Cusanovich, et al., 2018). This step normalizes for sequencing depth, upweights peaks that do not occur very frequently, and downweights prevalent peaks.

Text S4. Baseline methods
Existing methods have mainly focused on the estimation of the number of cell types in scRNAseq data, whereas efforts to accurately estimate the number of cell types in single-cell chromatin accessibility (scCAS) data are still limited. Here, we compared the performance of ASTER with four baseline methods, including Louvain, Leiden, scCCESS and scLCA.
Specifically, Louvain and Leiden are two community detection-based techniques to find the best possible grouping, and have been adopted in almost all the widely-used scCAS data analysis workflows, e.g., Signac (Stuart, et al., 2021), ArchR (Granja, et al., 2021) and EpiScanpy (Danese, et al., 2021). scCCESS is a stability-based approach for estimating the number of cell types by taking advantage of a random sampling-based ensemble deep clustering model (Geddes, et al., 2019;Yu, et al., 2022). scLCA is a machine learning based analytical pipeline that combines similarity measurement by latent cellular states with a graphbased clustering algorithm, and provides heuristic solutions for population number inference (Cheng, et al., 2019). scCCESS and scLCA have provided the overall best performance on the estimation of the number of cell types in scRNA-seq data in the most recent benchmark study (Yu, et al., 2022). Following the standard EpiScanpy analysis workflow for scCAS data, we performed Louvain and Leiden with default resolution to estimate the number of cell types in scCAS data. We performed scCCESS with SIMLR (Wang, et al., 2017) following the tutorial recommended by the benchmarking study (Yu, et al., 2022). We performed scLCA following the tutorial and fed the batch information into the method for the datasets with cells from multiple batches, considering that scLCA provides the option of batch effect correction. The range of optional I (I > 1) was set as ten values upstream and downstream of the true number of cell types to be evaluated to find the optimal number. All the benchmark experiments were performed on a machine with two Intel XeonPlatinum 8375C CPUs and 256GB of RAM.

Text S5. Data collection and processing
We collected 27 scCAS datasets generated from different species with 6 different protocols, and with various sizes, dimensions, numbers of batches, numbers of cell types, degrees of cell-type imbalance, cell states, and levels of sparsity for systematic benchmarking.
First, we collected human hematopoietic cells with donor label BM0828 from a bone marrow single-cell ATAC-seq (Fluidigm C1) dataset, and referred the dataset as BM0828BoneMarrow (Buenrostro, et al., 2018). We also collected CLP/CMP/MPP, a subset of bone marrow cells from 4 donors, and BoneMarrowA, the entire dataset of bone marrow cells from 7 donors, from the same study (Buenrostro, et al., 2018). All these three datasets provide cell type labels after fluorescent activated cell sorting. To further test the performance of various methods for differentiating cells, we then collected Melanoma, a dataset of cells in time series after knockdown of SOX10 in melanoma cell lines of two short-term patient cultures (Bravo Gonzalez-Blas, et al., 2019). To assess the performance on large-scale scCAS data, we collected BoneMarrowB, a dataset containing 136,463 cells of 14 differentiating cell types from 2 batches via dsciATAC-seq (Lareau, et al., 2019). In addition to differentiating cell states, we also collected three differentiated human cell-line mixtures (CellLineMixtureA, CellLineMixtureB and CellLineMixtureC) generated by scATAC-seq with Fluidigm C1 or combinatorial cellular indexing to further systematically demonstrate the performance of different methods (Buenrostro, et al., 2015;Cusanovich, et al., 2015).
Second, to investigate if the findings are comparable across different species and data sources, we collected a dataset of mouse splenocytes (referred to as Splenocyte) generated by a plate-based scATAC-seq method after red blood cell removal (Chen, et al., 2018).
Besides, we also collected Forebrain, a dataset derived from mouse forebrain by singlenucleus ATAC-seq to test the performance of various methods on cells from complex tissues (Preissl, et al., 2018). Given that most scCAS experiments generate data that capture cell types with different numbers of cells, sometimes with a high degree of cell-type imbalance, we further collected all the 17 datasets in Mouse sci-ATAC-seq Atlas (Cusanovich, et al., 2018) to test the impact of imbalanced ratios of cells among different cell types. The datasets were profiled by a combinatorial indexing assay (sci-ATAC-seq), contain both differentiating and differentiated tissues, and have different levels of sparsity (Cusanovich, et al., 2018).
We used the number of cell types identified in the original studies of the datasets as ground truth for benchmarking. The cell types were identified by experimental fluorescent activated cell sorting (e.g., datasets of BM0828BoneMarrow, CLP/CMP/MPP, and BoneMarrowA), by the various cell lines used to computationally construct the scCAS dataset (e.g., datasets of CellLineMixtureA, CellLineMixtureB, and CellLineMixtureC), or by unsupervised cell clustering followed by manually assigning the putative cell-type label to each cluster (e.g., datasets of BoneMarrowB, Forebrain, and Heart). The cell-type labels were considered reasonable and have been used to assess unsupervised cell clustering performance and supervised cell type annotation performance (Chen, et al., 2019;Chen, et al., 2021;Chen, et al., 2022;Liu, et al., 2021;Xiong, et al., 2019;Zamanighomi, et al., 2018).
A summary of the collected 27 scCAS datasets for benchmarking is provided in Supplementary Table S1. The imbalance degree of a dataset is defined by estimating the normalized entropy of the cell-type size distribution as follows: Similar to the existing scCAS data analysis methods (Chen, et al., 2021;Chen, et al., 2022;Liu, et al., 2021;Xiong, et al., 2019;Zamanighomi, et al., 2018), we selected peaks/regions that have at least one read count in at least 1% of the cells in the scCAS count matrix to reduce the noise level. For the downsampling procedure, we randomly dropped out the non-zero entries in the data matrix to zero with a probability equal to the dropout rate as (Chen, et al., 2021;Chen, et al., 2022).

Analysis of computational time and memory usage
With the increase of scCAS throughput and data scale, the computational approach should meet the basic requirement of efficiency and scalability. We also benchmarked the average computational time and peak memory usage of different methods. Note that both scCCESS and scLCA reported memory errors when running on BoneMarrowB, the largest real scCAS dataset used in this study. We thus subsampled 70% of cells from BoneMarrowB, and failed to perform these two baseline methods again, which indicates that scRNA-seq data-specific methods can hardly be applied to scCAS data, whose dimensions are usually dozens of times higher than that of scRNA-seq data.
We next randomly subsampled 50% cells from BoneMarrowB, repeatedly performed each method 10 times one-by-one on the machine with two Intel XeonPlatinum 8375C CPUs and 256GB of RAM, and recorded the average computational time and peak memory usage. As shown in Supplementary Fig. S1, although fast for computing, scLCA was the most memoryconsuming method compared to others and may prohibit its application to large-scale datasets.
In contrast, scCCESS required relatively less memory while consuming much more computational time, highlighting a trade-off between memory usage and computational efficiency for the methods. Leiden and Louvain, two similar and straightforward clustering methods, took the least amount of time and had the lowest memory usage. The computational time and memory usage of ASTER were acceptable for conventional personal computers and were moderate among all the methods, which was reasonable since ASTER is based on ensemble strategies. We also evaluated the estimation performance of different methods on this subsampled BoneMarrowB dataset. ASTER accurately estimated the number of cell types in this dataset, while Leiden, Louvain, scCCESS and scLCA had estimation errors of -5, -6, 5 and -9, respectively, and estimation deviations of -35.7%, -42.9%, 35.7% and -64.3%, respectively. Together, the above results suggest that ASTER has satisfactory scalability on the premise of accurate estimation of the number of cell types in scCAS data.

Advancement of ASTER on 27 scCAS datasets
We used 27 datasets generated from different protocols, and with various sizes, dimensions, numbers of batches, numbers of cell types, degrees of cell-type imbalance, cell states, and levels of sparsity for systematic benchmarking. As shown in Fig. 1b, ASTER accurately estimated the number of cell types in 19 of the 27 datasets and had low absolute estimation deviation on the other 8 datasets, providing superior and robust performance. However, scCCESS and scLCA, two of the state-of-the-art methods specific to scRNA-seq data (Yu, et al., 2022), almost always provided over-estimation (positive deviation) or under-estimation (negative deviation) in the experiments and had poor scalability to the scale of scCAS data.
Besides, ASTER had average reductions of 62.5% and 92.1% in estimation deviation compared to scCCESS and scLCA, respectively ( Supplementary Fig. S2). We further showed the significance of the advantages of ASTER by conducting one-sided paired Wilcoxon signed-rank tests to test if ASTER achieves significantly lower absolute estimation deviation than another method. As shown in Fig. 1c, the absolute estimation deviation of ASTER was significantly lower than all the baseline methods, indicating that the state-of-the-art methods specific to scRNA-seq data no longer have advantages in scCAS data.
We also benchmarked the performance of the widely-used Louvain and Leiden methods to estimate the number of cell types, following the standard EpiScanpy analysis workflow for scCAS data. Note that these two strategies have been adopted in almost all the widely-used scCAS data analysis workflows, e.g., Signac (Stuart, et al., 2021), ArchR (Granja, et al., 2021) and EpiScanpy (Danese, et al., 2021). Since these two straightforward strategies are based on the results of data processing and dimension reduction for scCAS data, they can be regarded as methods specific to scCAS data. However, as shown in Fig. 1b, these two methods also achieved only limited performance to estimate the number of cell types in scCAS data and the performance fluctuated greatly across different datasets. Again, ASTER provided significantly lower absolute estimation deviation than these two methods ( Fig. 1c and Supplementary Fig. S2), indicating that the widely-used methods specific to scCAS data still have a lot of room for improvement.
Given that ASTER is also based on Louvain and Leiden, we next looked deep into the benchmarking datasets. ASTER_DB_SC, which is based on the Davies-Bouldin index and the silhouette coefficient, achieved significantly lower absolute estimation deviation than Louvain and Leiden with P-values of 1.35e-5 and 9.05e-6, respectively. ASTER_WSS_SC, which is based on the WSS criterion and the silhouette coefficient, achieved significantly lower absolute estimation deviation than Louvain and Leiden with P-values of 3.01e-4 and 9.21e-4, respectively. The results suggest that the ensemble strategies makes ASTER robust to the results from Louvain and Leiden algorithms.

Performance on imbalanced scCAS datasets
We have shown the performance of ASTER on datasets with various imbalance degrees ( To mimic datasets with rare cell types, we used simATAC, a scCAS data simulation framework, to simulate cells of various cell types (Navidi, et al., 2021). More specifically, we used the dataset of BM0828BoneMarrow as the input of simATAC. BM0828BoneMarrow contains seven hematopoietic cell types which were identified by fluorescent activated cell sorting and can be regarded as ground truth labels. We simulated 50 CLP cells (common lymphoid progenitors) and simulated 500 cells for each of the remaining six cell types, resulting in a dataset of 3050 cells with a rare cell type that accounts for only 1.639%. The estimation result of ASTER was 7, while the results of Leiden, Louvain, scCCESS and scLCA were 9, 8, 12 and 7, respectively. The results indicate that ASTER and scLCA accurately estimated the number of cell types, while Leiden, Louvain and scCCESS tended to provide over-estimation when there is only one rare cell type in the scCAS data. We next simulated a dataset of 10 000 cells containing 9700 CLP cells and 50 cells for each of the remaining six cell types. The imbalance degree of this dataset is 0.903 and is higher than that of all the 27 benchmarking datasets.
The estimation result of ASTER was 7, while the results of Leiden, Louvain, scCCESS and scLCA were 9, 8, 4 and 2, respectively. The results indicate that ASTER again accurately estimated the number of cell types, Leiden and Louvain again tended to provide overestimation, while scCCESS and scLCA tended to provide under-estimation when there is only one major cell type in the scCAS data.
Taken together, the above results suggest that the ensemble-based ASTER method can effectively deal with size imbalance of clusters, e.g., from rare cell types, compared to the baseline methods.

Performance on large scCAS datasets
We have benchmarked the estimation performance of different methods on BoneMarrowB, a dataset containing 136,463 cells of 14 differentiating cell types from 2 batches via droplet single-cell assay for transposase-accessible chromatin using sequencing (dsciATAC-seq) (Lareau, et al., 2019). However, scCCESS and scLCA reported memory errors when running on this dataset. To further evaluate the methods on large datasets, we collected 37,818 cells profiled via a massively parallel droplet-based method in primary tumor biopsies from 7 patients with basal cell carcinoma (BCC) (Satpathy, et al., 2019). There are 20 tumor microenvironment (TME) cell types in this dataset. We again encountered memory errors To evaluate the methods on much larger datasets, we used simATAC (Navidi, et al., 2021) to simulate cells with ground-truth cell-type labels based on the dataset of BoneMarrowA, which contains 10 hematopoietic cell types. Specifically, we simulated 20,000 cells for each cell type, and obtained a dataset of 200,000 cells. As expected, scCCESS and scLCA failed to work on this large dataset again. ASTER accurately estimated the number of cell types in this dataset, while both Leiden and Louvain provided over-estimation with error of 7 and deviation of 70%.
The results suggest that ASTER can scale well to large datasets.

Performance on high-noise scCAS datasets
To mimic protocols that generate sparser scCAS data, we downsampled the reads in BoneMarrowA. We have looked deep into the performance of Louvain and Leiden when the dropout rate varies from 5% to 90%, and we found an interesting phenomenon that with the dropout rate increases, Louvain and Leiden constantly under-estimated the number of cell types and then over-estimated the number of cell types after reaching an inflection point.
Specifically, the number estimated by either Louvain or Leiden had an obvious decreasing trend when the dropout rate varied from 5% to 75%, and had an obvious increasing trend when the dropout rate varied from 75% to 90%. To quantify the decreasing and increasing trends, we adopted the Pearson correlation coefficient to measure the linear relationship between the dropout rate and the estimated number of cell types. When the dropout rate ranged from 5% to 75%, the Pearson correlation coefficients between the dropout rate and the numbers estimated by Louvain, Leiden and ASTER_SC were -0.946, -0.947 and 0.274, respectively, and the P-values of tests (the null hypothesis is the distributions underlying the samples are uncorrelated and normally distributed) were 9.46e-8, 8.76e-8 and 3.23e-1, respectively. When the dropout rate ranged from 75% to 90%, the Pearson correlation coefficients between the dropout rate and the numbers estimated by Louvain, Leiden and ASTER_SC were 0.996, 0.910 and -0.258, respectively, and the P-values of tests (the null hypothesis is the distributions underlying the samples are uncorrelated and normally distributed) were 4.09e-3, 8.97e-2 and 7.42e-1, respectively. Therefore, the reason Louvain and Leiden had lower performance when the dropout rate was low but relatively higher performance when it was high is that either Louvain or Leiden began with over-estimation, and had two chances to achieve more accurate estimation since the estimated number first decreased and then increased when the dropout rate varied from 5% to 90%. Besides, we noticed that ASTER_SC, a branch of ASTER that is based on Louvain and Leiden algorithms, was robust to the dropout rate, which again suggests the advantages of our ensemble strategies.

Text S7. Model ablation analysis
ASTER is based on the branches of the WSS criterion (WSS), the Davies-Bouldin index (DB), and the silhouette coefficient (SC). We also evaluated the performance of the ASTER variants with only a single branch (ASTER_WSS, ASTER_DB and ASTER_SC) or with the ensemble of two branches (ASTER_WSS_DB, ASTER_DB_SC and ASTER_WSS_SC). As shown in Supplementary Fig. S3, ASTER provided obviously lower absolute estimation deviation compared with the variants. ASTER had average reductions of 11.5%, 20.9%, 28.4%, 12.0%, 4.53% and 12.7% in estimation deviation compared to ASTER_WSS, ASTER_DB, ASTER_SC, ASTER_WSS_DB, ASTER_DB_SC and ASTER_WSS_SC, respectively ( Supplementary Fig. S4a). One-sided paired Wilcoxon signed-rank tests also demonstrated that the advantages of ASTER over the six variants were significant ( Supplementary Fig. S4b).
The results indicate that although based on scCAS data-specific processing approaches, the estimation strategies using either individual or a combination of two metrics cannot guarantee satisfactory performance, highlighting the contribution of ensemble strategies in ASTER.    P-values of one-sided paired Wilcoxon signed-rank tests that test if a method (one of the row names) achieves significantly lower absolute estimation deviation on the 27 datasets than another method (one of the column names). Note that ASTER_WSS failed to find the elbow point on several datasets.